The third homework
Author

Lindsay Jones

Published

October 31, 2022

Homework 3

Setup

Code
library(tidyr)
library(alr4)
Loading required package: car
Loading required package: carData
Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
library(ggplot2)

Question 1

Code
data(UN11)

1.1

Identify the predictor and the response.

Since we’re studying the dependence of fertility on ppgdp, the predictor is ppgdp and the response is fertility.

1.2

Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?

Code
scatterplot(fertility ~ ppgdp, UN11)

The data appears curvilinear, so a straight-line function would be inaccurate.

1.3

Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.

Code
scatterplot (log(fertility) ~ log(ppgdp), UN11)

The logarithm helps adjust the plots on the graph, so this model is much more plausible.

Question 2

2.1

How, if at all, does the slope of the prediction equation change?

The slope of the equation increases by 1.33.

2.2

How, if at all, does the correlation change?

The correlation should not change because the ratio of the values is constant.

Question 3

Draw the scatterplot matrix for these data and summarize the information available from these plots. (Hint: Use the pairs() function.)

Code
data(water)
pairs(water)

There appears to be a strong positive correlation between stream runoff and precipitation at OPBPC, OPRC, and OPSLAKE, so you could potentially predict water supply near those sites. Correlation between the two at the other sites seems loosely positively correlated, if at all.

Question 4

Create a scatterplot matrix of these five variables. Provide a brief description of the relationships between the five ratings.

Code
data("Rateprof")

rp <- Rateprof %>%
  select(quality, helpfulness, clarity, easiness, raterInterest)
Error in select(., quality, helpfulness, clarity, easiness, raterInterest): could not find function "select"
Code
pairs(rp)
Error in pairs(rp): object 'rp' not found
  • Rater interest appears to have no correlation (or possibly a very weak positive correlation) with any other variable.

  • Quality has a strong positive correlation with helpfulness and clarity, a weak positive correlation with easiness.

  • Helpfulness has a strong positive correlation with clarity and a week positive correlation with easiness.

  • Clarity has a weak positive correlation with easiness (easiness has a weak positive correlation with every variable).

Question 5

Code
data("student.survey")

ss <- student.survey

5.1

5.1.a

Code
lm(pi ~ re, data = ss)
Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
response will be ignored
Warning in Ops.ordered(y, z$residuals): '-' is not meaningful for ordered
factors

Call:
lm(formula = pi ~ re, data = ss)

Coefficients:
(Intercept)         re.L         re.Q         re.C  
     3.5253       2.1864       0.1049      -0.6958  

5.1.b

I could not make my code work for the categorical variables in this particular regression.

5.2

5.2.a

Code
fit2 <- lm(hi ~ tv, data = ss)

plot(hi ~ tv, data = ss)
abline(fit2)

5.2.b

Code
summary(lm(hi ~ tv, data = ss))

Call:
lm(formula = hi ~ tv, data = ss)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2583 -0.2456  0.0417  0.3368  0.7051 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.441353   0.085345  40.323   <2e-16 ***
tv          -0.018305   0.008658  -2.114   0.0388 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4467 on 58 degrees of freedom
Multiple R-squared:  0.07156,   Adjusted R-squared:  0.05555 
F-statistic: 4.471 on 1 and 58 DF,  p-value: 0.03879

The p-value and the plot both suggest that the negative correlation between hours spent watching TV and high school GPA is not strong. R-squared is not very close to 1, which also demonstrates the weakness of this relationship.